The Annals of Statistics 

2009, Vol. 37, No. 3, 1566-1590 

DOI: 10.1214/08-AOS622 

© Institute of Mathematical Statistics. 2009 



ON-LINE PREDICTIVE LINEAR REGRESSION 1 

By Vladimir Vovk, Ilia Nouretdinov and Alex Gammerman 

Royal Holloway, University of London 

We consider the on-line predictive version of the standard prob- 
lem of linear regression; the goal is to predict each consecutive re- 
sponse given the corresponding explanatory variables and all the 
previous observations. The standard treatment of prediction in lin- 
ear regression analysis has two drawbacks: (1) the classical prediction 
intervals guarantee that the probability of error is equal to the nom- 
inal significance level e, but this property per se does not imply that 
the long-run frequency of error is close to e; (2) it is not suitable 
for prediction of complex systems as it assumes that the number of 
observations exceeds the number of parameters. We state a general re- 
sult showing that in the on-line protocol the frequency of error for the 
classical prediction intervals does equal the nominal significance level, 
up to statistical fluctuations. We also describe alternative regression 
models in which informative prediction intervals can be found before 
the number of observations exceeds the number of parameters. One 
of these models, which only assumes that the observations are inde- 
pendent and identically distributed, is popular in machine learning 
but greatly underused in the statistical theory of regression. 

1. Introduction. Let y n , n = 1,2, . . . , be the sequence of response vari- 
ables to be predicted, and let x n = (x n> i, . . . , x nt x), n = 1, 2, . . . , be the cor- 
responding vectors of explanatory variables. The standard assumption of 
linear regression analysis is that the explanatory vectors determinis- 
tic and 

(1) y n = a + /3 • x n + £ n , 

where a is an unknown coefficient, P E M. K is an unknown vector of coeffi- 
cients, and £ n , n = 1, 2, ... , are IID (independent and identically distributed) 
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Gaussian random variables with mean and unknown variance cr 2 > [we 
will write £ n ~ N(0, cr 2 )]. The model (1) will be called the Gauss linear 
model. It is the standard textbook model. 

The standard classes of problems associated with the Gauss linear model 
are parameter estimation, testing hypotheses about parameters and predic- 
tion. In this paper we will be concerned only with prediction, mainly in the 
form of prediction intervals rather than point predictions. 

A major drawback of the Gauss linear model is that the corresponding 
prediction intervals are uninformative (i.e., coincide with the whole real line) 
unless the number of observations exceeds the number of parameters. The 
responses of a complex system cannot be realistically expected to be modeled 
using a small number of parameters, whereas the number of observations can 
be very limited. This motivates consideration of three other models in this 
paper, none of which requires that the number of observations should exceed 
the number of parameters. 

Perhaps the most important of these models is what we call the IID model: 
it is only assumed that the sequence of pairs (x n ,y n ) is IID. This model 
is nonparametric, effectively involving infinitely many parameters. Despite 
this, the model does allow one to obtain informative prediction intervals. 
The IID model, however, also has a fundamental limitation: informative 
prediction intervals become possible only when the number of observations 
reaches 1/e, where e is the chosen significance level. 

Our third regression model combines the assumption (1) with the as- 
sumption that x n are independent (between themselves and of £i,£2>---) 
and identically distributed Gaussian random vectors. We call it the MVA 
model, with MVA referring to "multivariate analysis." It has also been widely 
discussed in the statistical literature; for example, Sampson's (1974) "two 
regressions" refers to the Gauss linear model and the MVA model. This 
model is narrower than both Gauss linear and IID models, and its strong 
assumptions ensure that informative prediction intervals can be produced 
almost right away. 

Finally, we consider the combination of the Gauss linear and IID models, 
which we call the IID-Gauss model: in addition to (1) we assume that the 
explanatory vectors x n , n= 1,2,..., are random and IID (not necessarily 
Gaussian, as in the MVA model) and that the sequence £i,£2j--- is inde- 
pendent of the explanatory vectors. This model, however, appears to be of 
secondary importance. Empirically, it allows informative prediction inter- 
vals at significance level e soon after the number of observations exceeds the 
minimum of 1/e and the number of parameters. 

All the models considered in this paper are shown in Figure 1, with ar- 
rows leading from more general to more specific models. In this paper we 
begin (in Section 5) with the IID model. This is the most common model 
used in modern day statistics and it does not involve the often unrealistic 
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assumption that the noise variables £ n are Gaussian or that the explanatory 
vectors x n are Gaussian. An important advantage of the classical Gauss 
linear model, considered in Section 6, is that the explanatory vectors are 
not assumed to be IID (in other words, no "random design" is assumed). 
This model is essentially equivalent to making no assumptions whatsoever 
about the distribution of x n and assuming that the £ n in (1) are IID and 

distributed as N(0,a 2 ) conditional on xi,X2, The Gauss linear model 

(understood in this way) and the IID model are not comparable between 
themselves, but both contain the other two models: the IID-Gauss model 
(Section 8), which is the intersection of the IID and Gauss linear models, 
and the MVA model (Section 7), which makes the further assumption that 
the explanatory vectors are Gaussian. 

Fisher (1973), Section IV. 3, emphatically defended the use of the Gauss 
linear model even in the case where the distribution of the explanatory 
vectors is known (with or without parameters). There is also a view in 
the literature that the Gauss linear model and the MVA model are "essen- 
tially equivalent" [for a review of some results in this direction, see Sampson 
(1974)]. Our conclusion, however, is similar to Brown's (1990): when the 
MVA model is true, it can be far more useful for prediction; in particular, it 
can start giving informative prediction intervals long before the number of 
observations reaches the number of parameters K (or the inverse significance 
level 1/e). 

This paper uses a general method of prediction called conformal predic- 
tion. The method is reviewed in detail in the monograph by Vovk, Gam- 
merman and Shafer (2005) and introduced in the work leading up to that 
monograph. For each of the four models in Figure 1 we define a suitable con- 
fidence predictor, that is, a strategy for producing prediction intervals or, 
more generally, prediction regions. For the IID model we follow Vovk, Gam- 
merman and Shafer (2005) and for the Gauss linear model we use Fisher's 
classical confidence predictor. The confidence predictors for the MVA and 
IID-Gauss models are new. 

We are interested in two criteria of quality of confidence predictors, which 
we call "validity" and "accuracy." For valid confidence predictors, the prob- 
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Fig. 1. The four models considered in this paper (the three main models are given in 
boldface). 
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ability of error equals the nominal significance level e (or at least never 
exceeds e, in which case we will refer to them as "conservatively valid," or 
just "conservative," confidence predictors). The second criterion is applied 
only to valid confidence predictors: we want the prediction intervals to be 
as narrow as possible; in this paper we, somewhat arbitrarily, measure the 
narrowness of a prediction interval [a, b] by its length b — a. In particular, 
we want the prediction intervals to become bounded as soon as possible. 

Correspondingly, this paper uses two kinds of entities that one might want 
to call "models." The first kind is "hard models," such as the four models 
in Figure 1. These are the usual statistical models: our working hypothesis 
is that the data set was generated by one of the probability distributions in 
the model. In particular, the validity of our confidence predictors is allowed 
to depend on the hard model. By default, the word "model" means "hard 
model." 

In addition to the accepted hard model, one often has other a priori in- 
formation about the data-generating distribution: for example, only a few 
parameters might provide the bulk of the information relevant to prediction. 
Whereas we might hesitate to include such a priori information in the hard 
model explicitly, since it might destroy the validity of our confidence pre- 
dictor if this information happened to be far from the truth, we might still 
be able to use such information in designing accurate confidence predictors 
provided our model is flexible enough. A running example in this paper, 
introduced in Section 4, will be a linear system with 100 parameters ten of 
which are felt to be especially important. This will be our "soft model" (not 
defined formally); whether it is true or not affects only the accuracy, but 
not validity, of our confidence predictors. 

Separation of the available information about the data-generating distri- 
bution into the hard model and soft model increases robustness of confidence 
predictors with respect to modeling errors. If such an error occurs in the soft 
model, the validity of predictions is not affected. At worst the predictions 
will become useless, but they will not become misleading (with high prob- 
ability under any distribution in the hard model). For a further discussion 
and empirical study, see Gammerman and Vovk (2007), Section 4. 

The property of validity of conformal predictors can be stated in an es- 
pecially strong form in the on-line prediction protocol. It turns out that 
the true responses fall outside the corresponding prediction regions inde- 
pendently for different observations. In combination with the law of large 
numbers this implies that, with high probability, the frequency of error is 
approximately equal to the nominal significance level. Surprisingly, even for 
the classical prediction intervals in the Gauss linear model this property 
had been unknown prior to the work leading up to Vovk, Gammerman and 
Shafer (2005). 
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Two recent reviews of the theory of conformal prediction are Gammerman 
and Vovk (2007) and Shafer and Vovk (2008). Parts of these papers are 
devoted to regression problems. 

Section 2 formally introduces the on-line prediction protocol, with a more 
detailed discussion postponed until Section 9. In Section 3 we describe the 
method of conformal prediction and state two key results (proved in the 
Appendix): one asserts the strong validity and the other universality of con- 
formal predictors. Section 4 describes an artificial data set used in later 
sections for illustrating the performance of various conformal predictors. 
The following Sections 5-8 apply the method of conformal prediction to the 
IID, Gauss linear, MVA and IID-Gauss models, in this order. Section 10 
concludes. 

2. On-line protocol, part I. In our prediction protocol, the task is to se- 
quentially predict y n , n = 1,2,..., from x n and (xj,yj), i = 1, . . . , 
n — 1. This on-line protocol is popular in machine learning [see, e.g., 
Cesa-Bianchi and Lugosi (2006) and references therein], but most statisti- 
cal research (except some work on sequential analysis) is still done in the 
"off-line," or "batch," framework, where one starts from a complete sam- 
ple (xi,yi), . . . , (xAT,y7v)- One of the few statisticians advocating the on-line 
protocol (under the name "prequential," or predictive sequential) has been 
Dawid (1984). 

Weak and strong validity and median accuracy. To explain what pre- 
cisely we mean by validity and accuracy, the two criteria of predictive per- 
formance mentioned in Section 1, we will need the notation introduced in 
the following description of the on-line prediction protocol. 

On-line prediction protocol 

FOR n = 1,2,...: 

Predictor observes x n £ R ; 

Predictor outputs r^CR for all e £ (0, 1); 

Predictor observes y n 6 R; 

err n : = Wr~ for a11 e e (0, 1); 
L e n := supT £ n - inf T £ n for all e £ (0, 1) 
END FOR. 

(As usual, If is defined to be 1 if the condition F holds and if not.) At each 
step and for each significance level e, Predictor outputs a prediction region 
(usually, although not necessarily, an interval) C R. We require that, 
for all n, the family of prediction regions should be nested: T^ 1 C T 6 ^ 
whenever £\ > £2- An error is registered, err^j = 1, if the prediction region 
fails to contain the true response y n , and the accuracy of this particular 
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prediction is measured by the length L e n of the corresponding "prediction 
interval coY e n (coE standing for the convex hull of the set E). 

Let Err^ := errf + • — h err^ be the cumulative number of errors made up 
to, and including, step n. In the following sections, we will find it convenient 
to distinguish between two notions of validity, "weak validity" and "strong 
validity." 

Definition 1. A confidence predictor is defined to be a measurable pre- 
diction strategy T £ n = r e (xi,j/i, . . . ,x n _ 1 ,y n „ 1 ,x„) in the on-line prediction 
protocol. 

Definition 2. A confidence predictor is weakly valid in some statistical 
model if the probability that err^j = 1 is e, for each e G (0, 1) and each n 
under any probability distribution in the model. 

The definition of weak validity is standard [cf. Cox and Hinkley (1974), 
(75) on page 243]. Weak validity by itself does not imply that Err n /n is 
likely to be close to e for large n. 

Definition 3. A confidence predictor is strongly valid if it is weakly 
valid and, for each e € (0, 1), the events err^j = 1, n = 1, 2, . . . , are indepen- 
dent. 

Figure 3 below shows the plot of Err^ against n for a specific confidence 
predictor considered in this paper; it is typical of our predictors that the 
slopes of the plots of Err^ are close to the corresponding significance levels e 
(we use the significance levels 5%, 1% and 0.5% in all our figures, represented 
by the corresponding confidence levels 1 — e in the legends). This is the only 
figure in this paper illustrating the validity of our confidence predictors; such 
figures, in view of the mathematical results guaranteeing validity, tend to 
be uninformative. 

We will measure the accuracy of the predictions made for the first n obser- 
vations by the median M £ of the sequence Lf,...,L^; again, this measure 
is arbitrary, to a large degree. A plot of M £ against n will be called the 
median- accuracy plot; examples of such plots are given in Figures 2 and 
4-6. 

Unfortunately, the simple notions of validity introduced earlier have to 
be extended to become useful for our purpose. This is needed because, for 
example, the classical prediction intervals are uninformative before the num- 
ber of observations reaches the number of parameters, and so for small n 
the error probability is zero rather than e. Let Af be a set of positive inte- 
ger numbers (we are mainly interested in the case where M has the form 
{m,m + 1, . . .}). 
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Definition 4. We say that a confidence predictor is weakly valid for 
n G M in a statistical model if the probability is e that it makes an error, 
err^ = 1, at step n under any probability distribution in the model and for 
all n G N and e G (0, 1). It is strongly valid for n G N if, in addition, err^j, 
n G TV, are independent for any fixed e. 

The role of the on-line protocol. The exposition of this paper is based 
on the on-line protocol, but the majority of our findings are not constrained 
to this specific protocol. For example, the fact that valid and informative 
prediction intervals can become feasible in the MVA model before the num- 
ber of observations exceeds the number of parameters does not depend on 
the prediction protocol. In the absence of the on-line protocol, however, 
"validity" should be understood in the standard sense of weak validity. 

3. Conformal prediction. In this section we define a class of confidence 
predictors, called conformal predictors, and state results about their validity 
and universality, in a certain sense. 

Notions of sufficiency. Fix some observation space Z. We will be inter- 
ested in the space Z = M, K x R of pairs (x, y); in general, Z is a measurable 
space assumed to be Luzin, to ensure the existence of regular conditional 
probabilities. To define conformal predictors, we will need not only a sta- 
tistical model on Z°° but also a sequence of sufficient statistics S n : Z n — > 
S n , n = 1,2, . . .; we will always assume that E n = S n (Z n ). We will need a 
strengthened form of sufficiency; in our definitions we mainly follow Lau- 
ritzen (1988), Section II.2. 

The sequence (S n ) is algebraically transitive if there exists a sequence of 
measurable functions F n : E 2, 3, ... , such that 

Sn((l, ■ ■ ■ , Cn-l, Cn) = Fn(Sn-l\&i ■ • j Cn-l), Cn) 

for all (Ci, • • • ,Cn-i,Cn) e Z n . Intuitively, >Cn) is the summary of 

the first n observations, and the condition of algebraic transitivity means 
that the summary can be updated on-line. 

The sequence (S n ) is totally sufficient for a statistical model V on Z°° if, 
for each n = 1, 2, . . .: 

• S n is sufficient for V; 

• Ci> • • • > Cn an d Cn+i, Cn+2) • • • are conditionally independent given S n (Ci, ■ ■ ■ , 
Cn), where (Ci, C2, ...)~P, for any P G V. 

The second condition ensures that S n ((i, . . . , ( n ) carries all information in 
Ci, • • • , Cn that can be used for predicting the future observations Cn+i> Cn+2, • • 
A sequence of statistics that is both algebraically transitive and totally 
sufficient will be called an ATTS sequence. In the rest of this paper we will 
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often say "model" to mean a statistical model V equipped with an ATTS 
sequence (S n ). This makes the word "model" ambiguous as we often omit 
"statistical" in "statistical model," but this should not lead to misunder- 
standings. 

Each of the four statistical models considered in this paper (see Figure 1) 
will be complemented with an ATTS sequence; in all four cases the obser- 
vation space Z will be M K x M. 

Testing conformity. The main ingredient of conformal prediction is sta- 
tistical testing of conformity of a new observation Q n to the old observations 
Cij • • • j Cn-i- In general, our statistical tests will be randomized. 

Fix a statistical model V with an ATTS sequence S n : Z n — > S n . Define 
So to be a fixed one-element set. Any sequence of measurable functions 
A n : S n _i x Z — > R, n = 1, 2, . . . , is called a nonconformity measure; A n will 
be our test statistics. Given a nonconformity measure (A n ), for each sequence 
Ci, C2, ■ ■ ■ of observations and each sequence t%, T2, . . . S [0, 1] 00 we define the 
p-values 

Pn = Pn (Cl ) • ■ • i Cri) ^Vi) 

(2) : = F(A™ d > A°^ I S™ d = 5° bs ) + r n F(A™ d = A^ s \ S™ d = S° bs ), 

n = l,2,..., 

where A™ d := A n (S n _i(£i, • • - ,£n-i),£n) and S™d . = are the 

"random" values, < bs := ^(5^1 (Ci, • • • , Cn-i), Cn) and 5° bs := . . , Cn) 

are the "observed" values, and the probabilities are taken with respect to 
~P for some P £V. Since S n are sufficient statistics, p n do not 
depend on P E V (at least for a suitable choice of regular conditional prob- 
abilities). We will be interested in two cases: deterministic, where r n = 1 
for all n, and randomized, where t±,T2, ■ ■ ■ are generated independently from 
the uniform distribution U on [0,1] (such t\,T2, ... model the output of a 
random numbers generator). 

Theorem 1. Suppose that the sequence of observations (2, ■ ■ ■) € 2'°° 
is generated from a probability distribution P £ V and that the random num- 
bers (t\,T2, ■ ■ •) ~ U°° are independent of the observations. The p-values (2) 
are then independent and distributed uniformly on [0,1]: 

( P1 , P2 ,...)~U°°. 

For a proof of this theorem, see the Appendix. The fact that p n ^U is well 
known, at least in the continuous case [see, e.g., Cox and Hinkley (1974), 
page 66; (2) is a version of Cox and Hinkley's (1)]. 
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Conformal prediction. We start by extending, and spelling out in a greater 
detail, the notion of a confidence predictor: in the general theory of this sec- 
tion and in its application to the IID model in Section 5 we will need an 
element (typically quite small) of randomization in confidence predictors. 

Definition 5. A randomized confidence predictor is a measurable func- 
tion which maps every significance level e G (0,1), every data sequence 
xx, j/i, . . . ,x n _i, 2/n-l) every vector x n of explanatory variables, and every 
number r G [0, 1] to a set T e n = r e (xi, 2/1, . . . , x n _i,2/ n _i,x n , r) C R. We will 
use the notation when the data sequence, the vector of explanatory vari- 
ables, and the number r are clear from the context. 

Let the observation space be Z = M. K x M. Once the p-values (2) are 
defined, we can use them for confidence prediction [this is a standard pro- 
cedure; cf. Cox and Hinkley (1974), (76) on page 243]: we set 

r e (xi,2/i,.. . ,x n _i,y n _i,x n ,r n ) 

(3) 

:= {y G R:p n ((xi,2/i), . . . , (x n _i, y n _i), (x n , y),r n ) > e}. 

Definition 6. The randomized confidence predictor defined by (3) is 
called the smoothed conformal predictor determined by the nonconformity 
measure (A n ). A smoothed conformal predictor is a smoothed conformal 
predictor determined by some nonconformity measure. 

The following statement immediately follows from Theorem 1 and asserts 
that smoothed conformal predictors are strongly valid. 

Corollary 1. If the sequence of observations (x n ,y n ), n = l,2, 
is generated by a probability distribution P G and a smoothed conformal 
predictor is fed with random numbers (ti,t<2, ■ ■ ■) ~ U°° independent of the 
observations, the error sequence errf ,err|, . . . at any significance level e is a 
sequence of IID Bernoulli random variables with parameter e. 

The adjective "smoothed" refers to using random numbers; if we take 
T n = 1 for all n = 1, 2, . . . , we will obtain the definition of a "deterministic 
conformal predictor," or just "conformal predictor," and in this case we omit 
r n from our notation. 

Definition 7. A conformal predictor is the confidence predictor defined 

by 

, 2/i, ... , x n _i, y n —\, x n J 
:= {y G M:p n ((xi,2/i), • • • , (x n _i,2/ n -i), (x„, y), 1) > e}, 
where the p- values p n are defined by (2). 
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Notice that when a conformal predictor makes an error, the correspond- 
ing smoothed conformal predictor also makes an error. In combination with 
Corollary 1, we can see that conformal predictors are conservative, in the 
sense that, for each e, their error sequence errf ,err|, ... is dominated by a 
sequence of IID Bernoulli random variables with parameter e. In particular, 
whereas we have lim n ^ 00 (Err^/n) =e a.s. for smoothed conformal predic- 
tors, we only have limsup n ^ oc (Err^/n) < e a.s. for conformal predictors. 

We will see that there is no difference between conformal predictors and 
the corresponding smoothed conformal predictors for the Gauss linear model 
and n > K + 3 since the second addend on the right-hand side of (2) is then 
zero. There is also no difference for the MVA model and n > 3; however, 
the difference is important (although usually barely noticeable on error and 
accuracy plots) for the IID model. 

A natural question is whether there are other ways to achieve validity, ex- 
cept conformal prediction. The following theorem will give a negative answer 
to a version of this question. 

Definition 8. A confidence predictor V is invariant if T e n , n > 1, de- 
pends on the first n — 1 observations only through the value of 5 n _i on those 
observations. 

The use of invariant confidence predictors is natural in view of the suffi- 
ciency principle; see, for example, Cox and Hinkley (1974), Section 2.3(h). 
Let AT be a set of positive integers. We say that a confidence predictor is 
at least as accurate as another confidence predictor T for n £ Af if 

for all e, all n € Af, and P-almost all Xi,yi, . . . ,x n _i, y n _i,x n , under any 
probability distribution P £ V . 

Recall that a statistic 5 taking values in a measurable space £ is said to 
be boundedly complete (with respect to the statistical model V) if, for any 
bounded measurable function / : £ — > R, the following condition is satisfied: 
the expected value Ep (f(S)) of f(S) is zero under all P 6 V only if f(S) = 
P-almost surely for all P G V. 

Theorem 2. Let Af be a set of positive integers. Suppose the ATTS 
statistics S n are boundedly complete for n € Af. If a confidence predictor T 
is invariant and weakly valid for n 6 Af, then there is a conformal predictor 
that is at least as accurate as T for n € Af. 

This theorem is also proved in the Appendix. An important step toward 
its proof was made by Takeuchi (1975), page 31. 
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Table 1 

Steps at which informative prediction becomes possible 
for the four models; e is the significance level (e < 1/2 
is assumed) and K is the number of parameters 



Model 



The first step at which prediction intervals 
can become informative 



IID model 
Gauss linear model 
MVA model 
IID-Gauss model 



rv*i 



K + 3 
3 

min([l/e],K + 3) 



The condition of bounded completeness holds for the Gauss linear model 
and the MVA model by the standard completeness result for exponential 
statistical models [see, e.g., Theorem 4.1 in Lehmann (1986)], and it is also 
known to hold for the IID model [see the theorem on page 797 in Bell, 
Blackwell and Breiman (I960)]. 

4. Data set. We will illustrate the accuracy of various confidence pre- 
dictors using the following artificially generated data set with 600 observa- 
tions and K = 100 explanatory variables. The components x n & of x n are 
independently generated from iV(0, 1), and the responses y n are generated 
according to (1) with £ n ~ N(0, 1) independent between themselves and of 
all x n ^, with a = 100 and with the following components /3& of (5: 



The probability distribution generating this data set belongs to all four 
models considered in this paper (Figure 1). It is natural to expect that more 
specific models, when true, will lead to better predictions. In one respect this 
is true: more general models allow informative predictions later, as shown in 
Table 1 (to be explained in later sections). However, soon after the threshold 
given in the table is reached, the quality of prediction becomes very similar 
on our data set. 

The (informal) soft model guiding the choice of the nonconformity mea- 
sure will include the assumption of linearity (1) and the knowledge, or guess, 
that the first 10 explanatory variables are much more important than the 
rest. 

Relationship (1) between the response and explanatory variables can be 
written as 




fc = l,...,10, 
fc = ll,...,100. 



(4) 
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where 

7:=(;)eR*« and z„ =_ (^) e R™. 

For Z = 1,2,..., let Z; be the / x (A + 1) matrix whose rows are z^, i = 1, . . . , I, 
and be the vector whose ith element is yi, i = 1, . . . , I. We will sometimes 
refer to the first column of Z; as the dummy column. 

5. The IID model. The statistical model considered in this section is 
nonparametric: we simply assume that the observations (x n ,y n ) are IID. 
Notice that this does not involve the assumption of linearity of the "true" 
regression function or the assumption of a Gaussian noise. Linearity is, how- 
ever, an important component of the soft model used for choosing a suitable 
nonconformity measure. 

The ATTS statistics are 

S n :=l(xi,yi),...,(x n ,y n )J, 

where we use \ai, . . . , a n j to denote the bag, or multiset, consisting of a\, . . . , a n 
(some of these elements may coincide). For each n, the conditional distribu- 
tion of (£i, ... , £ n ) given that 

• • • = yi), • • • , (x n , y n )l, 

where are IID random elements taking values in M. K x R, assigns (with 
probability one) the same probability, l/nl, to every ordering (x^m , y^tt)), ■ ■ ■ > 

( X 7r(n) j ^(n 

)) of the bag |(xi, yi), . . . , (x n , y n ) j. 

The IID model is typical in that there is a great flexibility in choosing 
a nonconformity measure for use in conformal prediction. Suppose, for ex- 
ample, that the number of explanatory variables K is too large for us to 
estimate all the /3k and a in the soft model (1). We believe, however, that 
the first K\ <C K of the explanatory variables are especially important, and 
it is feasible to estimate the corresponding (3k, k = 1, . . . , K\, and a. 

Fix temporarily a positive integer number n. We will write y for y n , Z 
for Z n and A'^ for K^. Let U be the submatrix of Z consisting of the first 
+ 1 columns of Z: those that correspond to the explanatory variables 
deemed to be useful at this stage plus the dummy column 1. To test the 
conformity of the nth observation to the first n — 1 observations, we will 
first fit a hyperplane to all n observations using the relevant explanatory 
variables. Applying a small "ridge coefficient" a > to avoid the need to 
invert singular matrices, we obtain the vector of residuals 

(5) e^y-UOU'U + al^U'y, 

whose components will be denoted ex, . . . , e n . 
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We will be interested in the conformal predictor determined by the non- 
conformity measure 

(6) A n (S n -i(xi,yi, . . . ,x n _i,y n _i), (x n ,y n )) := \e n \. 

Deleted and, especially, studentized residuals would also be a natural choice 
[see, e.g., Vovk, Gammerman and Shafer (2005), pages 34-35]. In our expe- 
rience, however, the difference is not significant, and we stick to the simplest 
choice. The confidence predictor obtained from this conformal predictor by 
replacing the prediction regions with the prediction intervals coT^ will 
be called the IID predictor (cf . the comments at the end of this section) . 

The IID predictor can be implemented fairly efficiently. First notice that 
for the IID model the formula (2) for p-values can be simplified to 

/-x \{i:cti > a n }\ +T n \{i:ai = a n }\ 

(') Pn = , 

n 

where a» := A n ([Ci, . . . , Ci-l, (i+l, ■ ■ • , CrJ, (i), i ranges over {l,...,n}, and 
\E\ stands for the size of the set E. In the case of the nonconformity measure 
(6), on = |ej|. The residuals (5) can be written in the form 

e = y - U(U'U + aiy'U'y = Cy, 

where C is the matrix I — XJ (XJ'XJ + ai)" 1 ^' , not depending on the response 
variables. If we fix the first n — 1 response variables yi and vary the last one, 
y, the residuals Ci = &i(y), i = 1, ■ • ■ ,n, become linear functions of y (this 
fact will also be used in Section 7). By (7) with r n := 1, the p- value is the 
fraction of i = 1, .. . ,n satisfying |ej(y)| > |e n (y)|; therefore, as y varies from 
— oo to oo, the p- value can change only at the at most 2n — 2 points (called 
critical points) which are solutions to the linear equations ej(y) = e n (y) and 
e i(y) = ~ e n(y)- This divides the real line into at most 4n — 3 intervals: the 
critical points, considered as degenerate closed intervals, the open intervals 
bounded on both sides by adjacent critical points, and the two unbounded 
open intervals to the left of the leftmost critical point and to the right of the 
rightmost critical point; if there are no critical points, this collapses into one 
unbounded open interval E. We can compute the p-value for one point in 
each of these intervals and then compute as the union of the intervals with 
p- values exceeding e. The computation of the IID prediction interval coT^ 
can be simplified if we notice that the set is closed (which is opposite 
to what we will have for the Gauss linear and MVA models): assuming that 
the set of critical points is nonempty, coT^ is bounded if and only if the two 
unbounded intervals have p- values at most e, in which case the end-points of 
co can be found as the leftmost and rightmost critical points with p- values 
exceeding e. Computing T e n and coT^ from scratch (e.g., without using the 
results of computations from the previous steps of the on-line protocol) takes 
time 0{n log n) [see Vovk, Gammerman and Shafer (2005), page 33]. 
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For use in our experiments with the artificial data set described in Section 
4, we take 

(8) tft := / 10 > ifn<103, 

n ' \ 100, otherwise, 

and so define U as the first 11 columns of Z if n < 103 and as the full Z 
otherwise. Our chosen value for the threshold, 103, appeared to us slightly 
less arbitrary than other choices, since it is the first step when the classical 
prediction intervals [see (10)] become bounded. However, the quality of the 
estimates of a and the 100 components of j3 is still poor when n is close to 
103. This affects the quality of our prediction intervals but does not show 
on the median-accuracy plots. The value of the ridge coefficient is always 
a = 0.01. 

As Figure 2 shows, the IID predictor works well for our data set if the 
significance level is not too demanding: it can be seen from (7) (with r n := 1) 
that for the IID prediction interval coT^ to be bounded the number of 
observations n has to be at least 1/e (as Table 1 says). For example, for the 
significance level e = 0.5%, the IID predictor requires 200 observations to 
produce bounded predictions, and this shows on the median- accuracy plot 
at n = 399 (since for n < 399 at least half of the observed prediction intervals 
are infinitely wide). 

The IID model is nonparametric but we can see that it still admits valid 
confidence predictors (or conservative confidence predictors if one insists 
on using deterministic predictors). The threshold 1/e can be said to play 
the role of the number of parameters, and the nonparametric nature of the 
model is reflected in the fact that 1/e — ► oo as e — ► 0. Since 1/e tends to oo 
relatively slowly, such an infinite-dimensional model may be better for the 
purpose of prediction than a i^-dimensional model with a very large K. 




Fig. 2. The median- accuracy plot for the IID predictor. The three significance levels used 
in this and all the following figures are e = 0.05, 0.01, 0.005, shown in the form 100(1 — s)% 
(the corresponding confidence levels) in the legends. 
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Fig. 3. The cumulative numbers of errors made by the IID predictor: Erifj is plotted 
against n. 

Theorem 2 is not directly applicable to the IID model, since only smoothed 
conformal predictors are valid, as the latter term is used in this paper. Vovk, 
Gammerman and Shafer (2005), Section 2.4, state two results of the same 
nature about the IID model. 

There are two sources of conservativeness for the IID predictor as de- 
scribed above (and used for producing Figure 2). First, we used a deter- 
ministic predictor (taking r„ = 1 for all n), and second, we replaced each 
prediction region by its convex hull. Our experiments (see, e.g., Figure 3) 
show that we still have approximate validity. 

For each model considered in this paper except the Gauss linear model 
we define a nonconformity measure involving the matrix U defined earlier 
in this section. In the case of the IID model, we have used the nonconfor- 
mity measure (6) and called the corresponding conformal predictor with 
replaced by coT^ the IID predictor [it was called "Ridge Regression Con- 
fidence Machine" in Vovk, Gammerman and Shafer (2005)]. Of course, our 
brief term is somewhat misleading: it should always be borne in mind that 
the conformal predictor leading to the IID predictor is only one of many 
conformal predictors that can be defined in the IID model. Similarly, in the 
following three sections we will introduce the Gauss predictor, the MVA pre- 
dictor and the IID-Gauss predictor, which will also correspond to specific 
nonconformity measures. 

6. The Gauss linear model. Let A / l := (Z^Z^^Z^yz be the least-squares 
estimate of the parameter vector -y in (4) from the first I observations. For 
simplicity, we will assume that the matrix Z; has full rank [i.e., rankZ; = 
min(Z, K + 1)] for all /; this implies that ■ji is well defined for I > K + 1. 

Let y n be the least-squares prediction 7 n _i • z n for y n and 




1 



(yz - Zj7,)'(yi - Z i7i ) 



l-K-1 
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be the standard estimate of a 2 from Z; and yi. It is well known that in the 
Gauss linear model the ratio 

(9) T n := Vn-Vn = ^ n = K + 3,K + 4,..., 



1 + z^ 1 (Z' n _ 1 Z n _i) 1 z n a n - 1 

has the ^distribution with n — K — 2 degrees of freedom. This gives the 
classical weakly valid prediction interval for the nth response, 



II := {y € K: \y - y n \ < t e £ K Jl + ^(Z^Z^FV^n-l}, 

(10) 

n>if + 3, 

where t*, is the upper 5 point of the ^-distribution with m degrees of freedom. 
[See, e.g., Seber and Lee (2003), (5.27).] We set T e n to R when n < K + 3. 

Later in this section we will see that Corollary 1 implies the following 
property of the classical prediction intervals for the Gauss linear model. 

Corollary 2. Let ee (0,1). The events y n ^T £ n , n = K + 3,K + 4, 
are independent. In particular, the confidence predictor (10) is strongly valid 
for n> K + 3. 

Remark. We have not seen Corollary 2 stated explicitly in the lit- 
erature, but some closely related facts are known. Lemma 1 in Brown, 
Durbin and Evans (1975) asserts that (9) with <r n _i removed are inde- 
pendent iV(0,cr 2 ) random variables; this can be used for prediction when 
the standard deviation a is known. Seillier-Moiseiwitsch [(1993), Example 
1] shows that the statistics T n are independent when K = 0. It is interest- 
ing that both papers use the independence of T n for testing rather than for 
prediction. 

Let us now see that some conformal predictor outputs the classical pre- 
diction intervals (10). This will demonstrate that Corollary 2 is indeed a 
special case of Corollary 1. 

The ATTS statistics for the Gauss linear model are 

(n n n 

xi, . . . , %, ^2 y*> yi x i>Yl 
i=i t=i i=i 

(It is natural to have xi,...,x n as components of S n , although they are 
superfluous under our original definition, in which xi,X2, . . . are determinis- 
tic.) The prediction intervals (10) are precisely the prediction regions output 
by the conformal predictor corresponding to the nonconformity measure 

i(xi,yi,...,x n _i,y n _i),(x n ,y n )) 

(11) 

I Un Un | 

l + z' n (Z^_ 1 Z n _i)- 1 z n o- n _i 
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FlG. 4. The median- accuracy plot for the classical prediction intervals. 

[cf. (9); the goodness of the definition follows from the formulas given at 
the beginning of this section]. The expression on the right-hand side of (11) 
can be replaced by other natural expressions, such as \y n — y n \. See Vovk, 
Gammerman and Shafer (2005), Section 8.5, for further details. 

According to our general convention, the conformal predictor (10) is called 
the Gauss predictor (although its discoverer was Fisher rather than Gauss). 

We have already mentioned that the classical confidence predictor, 
given by (10), does not work when there are many parameters; in partic- 
ular, it is required that n > K + 3. Theorem 2 shows that there is hardly 
any way to use the knowledge that the first 10 explanatory variables are 
the important ones without abandoning the Gauss linear model: no weakly 
valid confidence predictor in a very wide and natural class can produce 
informative prediction intervals unless n > K + 3. Indeed, since the condi- 
tional distribution of the first n observations given S n is concentrated at one 
point for n < K + 1 and at two points for n = K + 2 with probability one, 
no conformal predictor and, therefore, no weakly valid invariant confidence 
predictor can give a bounded prediction region T e n for e < 0.5 and n < K + 2. 

Remark. A common reaction to the importance of the condition n > 
K + 3 is that one can use only a subset of explanatory variables when n < 
K + 3. We are, however, interested in confidence predictors that are valid 
under the Gauss linear model (1), not under some other model that is only 
"approximately true," in some ill-defined sense. 

Figure 4 gives the median-accuracy plot for the confidence predictor (10); 
the predictor works very well soon after the number of observations reaches 
K + 3 = 103. Since the median is plotted, the good quality of the prediction 
intervals shows only from n = 205: indeed, for n < 205 at least half of the 
observed prediction intervals are infinitely wide. 
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7. The MVA model. Remember that the MVA model assumes, besides 
(1), that x n are generated independently from the same unknown multivari- 
ate Gaussian distribution on M. K , with the noise random variables £i , £2 , • • • 
independent of xi,X2, The ATTS statistics in the MVA model are 



\i=l i=l i=l i=l i=l / 

equivalently, the ATTS statistics can be defined to be the empirical means 
and covariances of all variables, that is, the response and the explanatory 
variables. 

Let y := y n , Z := Z n , K) := K\ and U be as in Section 5. Suppose the 
value of the statistic S n is known. The vector of residuals (5) can now be 
written as 

(12) e:=y-U(U / U + aI)- 1 U , y = y-Uc, 

where c := (U'U + al^^U'y is a known vector. Since the joint distribution 
of y and the nondummy columns of U is invariant with respect to rotations 
around the vector 1, the distribution of e will also be invariant with respect 
to such rotations. It might help the reader's intuition to notice that knowing 
the value of S n is equivalent to knowing the lengths of and the angles between 
the following K + 2 vectors: the K + 1 columns of Z and y. 

In the rest of this section we will assume n > 3 (with arbitrary conventions 
for n = 1, 2). Let e\, . . . , e n be the components of the vector (12) of residuals 
and e n _i be the average of e\, . . . , e n -\. A standard statistical result [Fisher 
(1925)] allows us to conclude that 



(13) 



71 1 € n C n —\ 



(l/(^-2))Er=i 1 fe-e, 



n 

in—l. 



has the t-distribution with n — 2 degrees of freedom. 

Let us see how to implement the conformal predictor corresponding to 
the nonconformity measure 

(14) A n (S n _i(xi,yi,...,x„_i,y n _i),(x n ,y n )) : = 



which is proportional to (13); the fact that the right-hand side of (14) de- 
pends on the first n — 1 observations only through the value of S n -\ can 
be seen from the representation (12), where c is a known vector. First we 
replace the true value y n by variable y ranging over 1R. Each residual be- 
comes a linear [according to (12), where c also depends on y] function ej(y) 
of y, and the prediction region can be written as 



y E R : , < ^ 

H ^a/(n-2))El=i(e l (y)-e n - l (y)y 
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The inequality in this formula is quadratic in y, so is easy to find. We 
can see that the prediction region for y n is an interval (empirically, this is 
the typical case), the union of two rays, the empty set, or the whole real 
line. 

Replacing by coT^ in the conformal predictor we have just defined 
gives the MVA predictor. Our experiments with the artificial data set of 
Section 4 are carried out as before [cf. (8)]: U is defined as the first 11 
columns of Z if n < 103 and as the full Z otherwise. 

The median-accuracy plot for the MVA predictor and our artificial data 
set is shown in Figure 5. Before the threshold 103 the predictor quickly learns 
a and the first 10 parameters and its performance more or less stabilizes 
before quickly improving again when it starts learning the other parameters 
from n = 103 onward; the second improvement in the performance shows on 
the median-accuracy plot from n = 205. 

The performance of the MVA predictor is better than the performance 
of any other confidence predictor considered in this paper. Of course, this 
should not be taken to mean that the other predictors are worse. Different 
predictors are based on different information about the data set. None of 
the predictors "knows" that the components of x n are realizations of in- 
dependent standard Gaussian random variables; even the MVA model, the 
narrowest model considered in this paper, allows arbitrary means of and 
arbitrary correlations between different explanatory variables for the same 
observation. The Gauss predictor does not know that the IID and 

Gaussian. The IID predictor only knows that the observations (x n ,y n ) are 
IID, and the IID-Gauss predictor, introduced in the next section, knows, in 
addition, that the y n are generated by (1). 

The median-accuracy plot for each of the four predictors is essentially 
determined by that for the MVA predictor and the threshold for the cor- 
responding model as shown in Table 1. It is convenient to represent each 

150 
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Fig. 5. The median- accuracy plot for the MVA predictor. 
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line on a median-accuracy plot as the function that maps each value for the 
accuracy in the interval [0, 150] to the first step at which that accuracy is 
achieved (so the graph of this function is obtained by rotating the page by 
90° counterclockwise). Each of the three functions in Figure 2 is, approxi- 
mately, the maximum of 2[~l/e] and the corresponding function in Figure 
5. Similarly, each of the three functions in Figure 4 is, approximately, the 
maximum of 2(K + 3) = 206 and the corresponding function in Figure 5. As 
usual, the factor of 2 appears because of the use of median in our accuracy 
plots. 

8. The IID-Gauss model. As defined in Section 1, the IID-Gauss model 
is the combination of the Gauss linear and IID models: we assume both that 
the observations are IID and that the responses are generated by (1) with 
£i,£2> • • • independent of xi,X2, Correspondingly, the ATTS statistics are 



Using the nonconformity measure (6) and replacing the prediction regions 
output by the corresponding conformal predictor with their convex hulls, 
we obtain the IID-Gauss predictor. Its performance on our usual data set 
is shown in Figure 6. We do not know whether the IID-Gauss predictor can 
be implemented efficiently, and Figure 6 was produced using Monte-Carlo 
sampling from the conditional distributions given S n . However, comparing 
Figure 6 to Figures 2 (to the left of n = 205) and 4 (to the right of n = 205), 
we can see that the following simple confidence predictor will work almost 
as well as the IID-Gauss predictor on our data set: predict using the IID 
predictor if n < 103 and predict using the Gauss predictor if n > 103. As in 
all other cases in this paper where the threshold n = K + 2> = 103 appears, the 
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Fig. 6. The median- accuracy plot for the IID-Gauss predictor. 
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best switch-over point will be slightly greater than K + 3, but the question 
of when exactly to switch is outside the scope of this paper. 

Remark. The IID predictor and the IID-Gauss predictor use the same 
nonconformity measure, (6), but still produce very different median-accuracy 
plots at confidence level 99.5%. This happens because of the conditioning 
on the event S^ nd = S° bs in the definition (2). Since the ATTS statistics 
perform more radical data compression in the case of the IID-Gauss model, 
the achievable values of P(^™ d > ^° bs | S v n nd = S° bs ) [corresponding to (2) 
with r n := 1] are much smaller than the 1/n achievable under the IID model. 

As in the previous section, there is a close connection between Figures 5 
and 6: each of the three functions in Figure 6 is, approximately, the maxi- 
mum of 2 min( [1 /e] , K + 3) and the corresponding function in Figure 5. The 
distributive law of max over min now implies that each of the three functions 
in Figure 6 is the minimum of the corresponding functions in Figures 2 and 
4. 

9. On-line protocol, part II. In this section we will briefly discuss the 
relation of our results about the IID model to Wilks's nonparametric pre- 
diction intervals and mention some relaxations of the on-line protocol. 

The univariate IID model. The construction of prediction and tolerance 
intervals in the univariate IID model, which says that y\ , 1/2 , • • • form an 
IID sequence, was undertaken by many authors following the pioneering 
paper by Wilks (1941). Wilks's work was later extended to the multivariate 
case: see, for example, Fraser (1957); this extension, however, is not directly 
related to our IID predictors. For simplicity, let us assume in this subsection, 
as is customary in literature, that the distribution of one observation is 
continuous. Correspondingly, we will assume that the realized values of y n , 
n = 1, 2, . . . , are all different. 

For each n = 1, 2, . . . , define T n € {1, 2, . . . , n} as the smallest i such that 
y n < where t/( n _i ; i), • • • ,y( n -i,n-i) is the sequence of the first n - 1 

observations y±, . . . ,y n -i sorted in the ascending order; if y n > yi n -i, n -i)i 
set T n := n. Each T n is a "pivot," being distributed uniformly on the set 
{1, . . . ,n}. Wilks suggested the following prediction intervals based on this 
fact: fix a number r G {1, 2, . . .} and define Y^^ n , n = 2r + 1, 2r + 2, . . . , to be 

the interval (y( n _i jr ), 2/( n _i jn _ r )); the probability of error, y n T^™, is then 
2r/n. Now Theorem 1 implies that the whole random sequence (T\,T2, . . .) 
has a known distribution: namely, it is distributed according to the product 
Ui x U2 x • • • of the uniform distributions U n on {l,...,n}. In particular, 

It In 

Wilks' prediction intervals T n , n = 2r + 1, 2r + 2, . . . , lead to independent 
errors. 
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Relaxations of the on-line protocol. This paper concentrates on the on- 
line prediction protocol. Smoothed conformal predictors lead to independent 
errors in the on-line protocol, and Theorem 2 suggests that conformal predic- 
tors are the most natural weakly valid confidence predictors. This is why we 
included the requirement of independence in the definition of strong validity, 
despite the fact that the error frequency can be shown to approach the error 
probability e with probability approaching one even when the requirement 
of independence is relaxed in certain ways. 

The situation changes when we move outside the on-line protocol. The 
on-line protocol is natural, but in one respect it is overly restrictive: the true 
response y n becomes known before the prediction for the next response y n +i 
is made. It can be shown that the error frequency will still converge to e 
if the true response is only given for a small fraction of observations, and 
even for those observations it can be given with a delay [Vovk, Gammer- 
man and Shafer (2005), Section 4.3; see also Vanderlooy, van der Maaten 
and Sprinkhuizen-Kuyper (2007) for a recent empirical study]. The inde- 
pendence of errors, however, will be lost (we can still have "approximate 
independence," but this is a much more elusive notion than ordinary inde- 
pendence) . 

10. Conclusion. In this paper we considered the problem of prediction in 
three main regression models. One of these models, the Gauss linear model, 
is the standard textbook one. The MVA model seems to have been some- 
what neglected, partly because of philosophical reasons: according to the 
conditionality principle [Cox and Hinkley (1974), Section 2.3(iii)] one should 
condition on the observed values of the explanatory variables to make the 
prediction (or estimate, etc.) more relevant to the data at hand. In most of 
this paper we took a pragmatic approach, studying which models permit one 
to produce informative prediction intervals in different circumstances with- 
out being restricted a priori by general principles. We did use the sufficiency 
principle in our interpretation of Theorem 2, but we admit this makes the 
theorem less convincing. Surprisingly, the IID model appears to have been 
neglected in the field of regression, even in nonparametric statistics, where 
the value of this model is in principle well understood. 

APPENDIX: PROOFS OF THE THEOREMS 

In this appendix we will prove the two main results stated in this paper, 
Theorems 1 and 2. A version of Theorem 1 was proved in Section 8.7 of Vovk, 
Gammerman and Shafer (2005), but we reproduce the principal points of 
the proof to make our exposition self-contained. A special case of Theorem 2 
(namely, for the IID model) was proved in Section 2.6 of Vovk, Gammerman 
and Shafer (2005). 
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Proof of Theorem 1. In this proof, Clj C25 • - ■ wu l De random observations 
generated by P £ V, ((1,(2, ■ ■ ■) ~ P, and t\, T2, ■ ■ ■ will be random numbers, 
(ti,T2, . . .) ~ U°°. For each n = 0, 1, . . . let Q n be the cr-algebra generated by 
the random elements 



So Qq is the most informative cr-algebra and Qq ~D Q\ D £/2 2 • • • • It will be 
convenient to write ¥g(E) and Eg(£) for the conditional probability ¥(E \ Q) 
and expectation E(£ | Q), respectively, given a cr-algebra Q. 

Lemma A.l. For any step n = 1, 2, . . . and any e £ (0, 1), 



Proof. For a given value of the summary S n ((i, . . . , £ n ) of the first n 
observations, consider the conditional distribution function F of the random 
variable 77 := A n (S n -i(Ci, ■ ■ ■ , Cn-i) 5 Cn) (because of the total sufficiency, it 
does not matter whether we further condition on ( n +i,T n +i,(n+2,T n+ 2, • • •)• 
Define F(x-) to be sup t<x F(t). Our task is to show that the conditional 
probability of the event 



is e [since the left-hand side of (A.l) coincides with the right-hand side of 
the definition (2)]. The latter fact is usually stated in statistics textbooks 
for continuous F [see, e.g., Cox and Hinkley (1974), page 66], but it is also 
easy to check in general. □ 

Lemma A. 2. For any step n = 1, 2, . . . , p n is Q n _\-measurable. 

Proof. This follows from the definition: p n is defined in terms of £ n , r n 
and the summary of the first n — 1 observations. □ 

Now we can easily prove the theorem. First we demonstrate that, for any 
n = l,2,... and any ei,...,e n G (0,1), 

(A. 2) ^g n (Pn <e n ,...,pi < £1) = e n ---ei a.s. 

The proof is by induction on n. For n = 1, (A. 2) is a special case of Lemma 
A.l. For n > 1 we obtain, from Lemmas A.l and A. 2, standard properties 
of conditional expectations, and the inductive assumption: 



'S'n(Cl) • ■ • Xn), Cn+ljTn+1, Cn+2,7n+2, 



Vg n ( Pn <e)=e. 



(A.l) 



1 - F(n) + T n (F( V ) - F( V -)) < e 



...,Pi<e 1 )=Eg n (Eg n _ 1 (l Pn < Sn l Pn _ 1 < £n _ lr .. t 

= ^g n {\ n <e n Eg n _ 1 {lp n _ 1 < £n _ 1 ,..., 
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The "tower property" of conditional expectations immediately implies 
P(Pn<£n,---,Pi <£i) = e n ---ei. 

Therefore, the distribution of the first n p- values pi,...,p n is U n , for all n = 

1,2, This implies that the distribution of the infinite sequence pi,P2> • ■ • 

is U°°. 

Proof of Theorem 2. In this proof, Z := M. K x M and Q stands for (xj, j/j) . 
Let n£jV. 

For each summary s 6 S n let /(s) be the conditional probability given 
S'n(Ci) • ■ • >Cn) = s that r makes an error at a significance level e when pre- 
dicting y n from Ci, • • • , Cn-i and x n , the observations Ci ^ C2 , • ■ • being gener- 
ated from P S V. We know that the expected value of f(S n (Ci, • • • , Cn)) is e 
under any Pe?, and this, by the bounded completeness of S^, implies that 
f(s) = e for almost all (under PS' 1 for any P £V) summaries s. Define 
E(s,e) to be the set of all pairs (s',C) = ( S ',( X ,Z/)) S x Z such that 

Fn(s',C) = s (where F n is the function from the definition of the algebraic 
transitivity of the S n ) and T makes an error at the significance level e when 
predicting y and fed with (1, ■ ■ ■ , ( n -i satisfying 5' n ,_i(Ci, • ■ • , Cn-i) = s ' and 
with x (since V is invariant, whether an error is made depends only on s', 
not on the particular £1, . . . , Cn-l)- It is clear that 

£i<£2 =^ E(s,e 1 )QE(s,e 2 ) 

and 

P((5„_i(Ci, • • • , Cn-i), Cn) G I 5 re (Ci, . • . ,Cn) = s) = e a.s., 

where (Ci, C2, • • •) ~ P G P. 

In this proof we say "conformity measure" to mean a nonconformity mea- 
sure which is used for computing p- values in the opposite way to (2): the 
">" in (2) is replaced by "<." Let us check that the conformal predictor F' 
determined by the conformity measure 

A n {s\ C) ■= mf{e : {s\ Q € E(F n (s', C),e)} 

is at least as accurate as T. By the monotone convergence theorem for con- 
ditional expectations, 

P(A n (S n _i(Cl, • • • , Cn-l), Cn) < £ I S n (Cu • ■ • , Cn) = «) 

= limP(^ n (5 n _!(Cl, • • • , Cn-l), Cn) < 5 I 5 n (Cl, • • • , Cn) = s) 
Sle 

< limP((5' re _i(Ci, • • • , Cn-l), Cn) G E(s, 5) I 5 n (Ci,...,Cn) = s) 
= lim<5 = e a.s., 

Sle 
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where (Ci, • • •) ~ P G V and S is constrained to be a rational number. 
Therefore, at each significance level e and for all (£1, • • • , Cn) €= 

y„G (rt) £ (Ci,...,Cn-i,x n ) 

<^> P(^™ d < ^° bs | S™ d = 5° bs ) > e 

=► < t bs >e 

=> (5„-l(Cl, • • • , Cn-l), Cn) t E(S n (Cl,. • • , Cn), £) 

<^=> y„ Gr £ ((i,...,C„_i,x n ) a.s., 
in the notation of (2) and for (£1, £2, ■■ ■) ~ P £V. 
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Note added in proof. The R package PredictiveRegression, available 
from CRAN, implements the three prediction algorithms (IID predictor, 
Gauss predictor and MVA predictor) described in this paper. 
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